library(here) # for reproducible paths
library(SingleCellExperiment)
library(scater) # For qc and visualisation
library(scran) # For normalisation
library(Matrix) # For log transorming the raw data
library(ggplot2) # To add titles to plots
library(batchelor) # Mnn, for batch correction
# Adapted function from VISION to log tranform sparse matrix
# I could not download the package
matLog2 <- function(spmat, scale = FALSE, scaleFactor = 1e6) {
if (scale == TRUE) {
spmat <- t( t(spmat) / colSums(spmat)) * scaleFactor
}
if (is(spmat, "sparseMatrix")) {
matsum <- summary(spmat)
logx <- log2(matsum$x + 1)
logmat <- sparseMatrix(i = matsum$i, j = matsum$j,
x = logx, dims = dim(spmat),
dimnames = dimnames(spmat))
} else {
logmat <- log2(spmat + 1)
}
return(logmat)
}
In order to correct for systematic differences in sequencing coverage between libraries we will normalise the dataset. This involves dividing all counts for each cell by a cell-specific scaling factor, often called a “size factor” (Anders and Huber 2010). The assumption here is that any cell-specific bias (e.g., in capture or amplification efficiency) affects all genes equally via scaling of the expected mean count for that cell. The size factor for each cell represents the estimate of the relative bias in that cell, so division of its counts by its size factor should remove that bias.
Specifically we will used the deconvolution method available in the scran package. This method allows to take in consideration the composition bias between samples (Lun et al., 2016)
# Only compute if first time
if (!(file.exists(here("processed", "sce_norm.RDS")))) {
sce <- readRDS(here("processed", "sce_QC.RDS"))
# For reproducibility
set.seed(100)
# Quick clustering to pool samples together and deal with 0 counts
quick_clusters <- quickCluster(sce)
# Calculate size factors
sce <- computeSumFactors(sce, cluster = quick_clusters, min.mean = 0.1)
# Check that there are not negative size factors
summary(sizeFactors(sce))
# Apply size factors and log transform them
sce <- logNormCounts(sce)
# Also log normalise the raw counts
assay(sce, "logcounts_raw") <- matLog2(counts(sce))
saveRDS(sce, here("processed", "sce_norm.RDS"))
} else{
sce <- readRDS(here("processed", "sce_norm.RDS"))
}
On top of normalisation the data is also log-transformed. The log-transformation is useful as differences in the log-values represent log-fold changes in expression. Or in other words, which is more interesting - a gene that is expressed at an average count of 50 in cell type A and 10 in cell type B, or a gene that is expressed at an average count of 1100 in A and 1000 in B? Log-transformation focuses on the former by promoting contributions from genes with strong relative differences.
Variable-level metrics are computed by the getVarianceExplained() function (before and after normalization). This calculates the percentage of variance of each gene’s expression that is explained by each variable in the colData of the SingleCellExperiment object. We can then use this to determine which experimental factors are contributing most to the variance in expression. This is useful for diagnosing batch effects or to quickly verify that a treatment has an effect.
The percentage of variance explained by a factor is on the x axis, and in the y axis there is the density of the R-squared values across all genes.
The “total” label is the total number of molecules, that correlates with the detected number of genes, “detected”.
Before normalisation it is expected that most variance will be explained by the sequencing depth, i.e. the total number of umis and the total number of genes
# Before normalisation
# Only compute if first time
if (!(file.exists(here("processed", "variance_explained.RDS")))) {
# Calculate the matrix (time consuming step)
var <- getVarianceExplained(
sce,
exprs_values = "logcounts_raw",
variables = c(
"tissue",
"chip",
"Sample",
"age",
"subsets_mt_percent",
"detected",
"total"
)
)
saveRDS(var, here("processed", "variance_explained.RDS"))
#If not just load created object
} else {
var <- readRDS(here("processed", "variance_explained.RDS"))
}
plotExplanatoryVariables(var)
## Warning: Removed 126 rows containing non-finite values (stat_density).
We can see how there is less variance explained now by factors such as the detected genes or the number of counts
# After normalisation
if (!(file.exists(here("processed", "variance_explained_norm.RDS")
))) {
var_norm <- getVarianceExplained(
sce,
variables = c(
"tissue",
"chip",
"Sample",
"age",
"subsets_mt_percent",
"detected",
"total"
)
)
saveRDS(var_norm, here("processed", "variance_explained_norm.RDS"))
} else{
var_norm <- readRDS(here("processed", "variance_explained_norm.RDS"))
}
plotExplanatoryVariables(var_norm)
## Warning: Removed 126 rows containing non-finite values (stat_density).
# Again after normalisation but other parametres
if (!(file.exists(
here("processed", "variance_explained_norm_2.RDS")
))) {
var_norm_2 <- getVarianceExplained(
sce,
variables = c(
"tissue",
"chip",
"age",
"genotype",
"mouse",
"total"
)
)
saveRDS(var_norm_2, here("processed", "variance_explained_norm_2.RDS"))
} else{
var_norm_2 <- readRDS(here("processed", "variance_explained_norm_2.RDS"))
}
plotExplanatoryVariables(var_norm_2)
## Warning: Removed 108 rows containing non-finite values (stat_density).
Another way to assess the variance is by a plot with a PCA plot. Here again we can see how the sequencing depth(sum) explains most of the variance before the normalisation (49%)
raw <- runPCA(sce, exprs_values = "logcounts_raw")
plotPCA(raw, colour_by= "chip", size_by="sum") + ggtitle("Before normalisation")
sce <- runPCA(sce)
plotPCA(sce, colour_by= "chip", size_by="sum") + ggtitle("After normalisation")
plotPCA(sce, colour_by= "chip", point_size=0.1) +
ggtitle("After normalisation, small dots")
Combat was shown to outperform other batch correction methods for simple batch correction (Buttner et. al, 2019). However, this will also regress other biological differences that are not well balanced between batches. Integration techniques account for this fact, with the downside that it can lead to over-correction due to increased degrees of freedom of these non-linear methods.
We use a merge tree, useful for merging together batches that are known to be more closely related before attempting difficult merges involving more dissimilar batches.
if (!(file.exists(
here("processed", "sce_corrected.RDS")
))) {
set.seed(100)
sce <- correctExperiments(sce,
batch = factor(sce$chip),
PARAM = FastMnnParam(
merge.order = list(
list(list("3","5"), list("4","6")),
list("1","2")
)
)
)} else {
sce <- readRDS(here("processed", "sce_corrected.RDS"))
}
I have never seen this plot used after batch correction and I am not 100% sure ( and I hope) it is incorrect.
# Only compute if first time
if (!(file.exists(here("processed", "variance_explained_corrected.RDS")))) {
# Log transfrom the corrected values
assay(sce, "log_reconstructed") <- matLog2(assay(sce, "reconstructed"))
# Calculate the matrix (time consuming step)
var_corrected <- getVarianceExplained(
sce,
exprs_values = "log_reconstructed",
variables = c(
"tissue",
"chip",
"age",
"genotype",
"mouse",
"total"
)
)
saveRDS(var_corrected, here("processed", "variance_explained_corrected.RDS"))
#If not just load created object
} else{
var_corrected <- readRDS(here("processed", "variance_explained_corrected.RDS"))
}
plotExplanatoryVariables(var_corrected)
## Warning: Removed 108 rows containing non-finite values (stat_density).
plotReducedDim(sce, colour_by= "chip", point_size=0.1, dimred = "corrected") +
ggtitle("After batch correction, small dots")
One useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is in the metadata, which contains a matrix of the variance lost in each batch (column) at each merge step (row).
Large proportions of lost variance (>10%) suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace. In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern.
metadata(sce)$merge.info$lost.var
## 1 2 3 4 5 6
## [1,] 0.04721138 0.05320356 0.000000000 0.00000000 0.000000000 0.00000000
## [2,] 0.00000000 0.00000000 0.000000000 0.01051195 0.000000000 0.01101676
## [3,] 0.00000000 0.00000000 0.007692128 0.00000000 0.008291534 0.00000000
## [4,] 0.00000000 0.00000000 0.009484231 0.01099193 0.009133937 0.01440139
## [5,] 0.05004264 0.05688210 0.044680898 0.04679437 0.041414986 0.04322228
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] batchelor_1.6.2 Matrix_1.3-2
## [3] scran_1.18.5 scater_1.18.6
## [5] ggplot2_3.3.3 SingleCellExperiment_1.12.0
## [7] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.2
## [11] IRanges_2.24.1 S4Vectors_0.28.1
## [13] BiocGenerics_0.36.0 MatrixGenerics_1.2.1
## [15] matrixStats_0.58.0 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.4 Rcpp_1.0.6
## [3] rsvd_1.0.3 lattice_0.20-41
## [5] rprojroot_2.0.2 digest_0.6.27
## [7] utf8_1.1.4 R6_2.5.0
## [9] evaluate_0.14 highr_0.8
## [11] bluster_1.0.0 pillar_1.5.1
## [13] sparseMatrixStats_1.2.1 zlibbioc_1.36.0
## [15] rlang_0.4.10 irlba_2.3.3
## [17] rmarkdown_2.7 labeling_0.4.2
## [19] BiocNeighbors_1.8.2 statmod_1.4.35
## [21] BiocParallel_1.24.1 stringr_1.4.0
## [23] igraph_1.2.6 RCurl_1.98-1.2
## [25] munsell_0.5.0 beachmat_2.6.4
## [27] DelayedArray_0.16.2 compiler_4.0.4
## [29] vipor_0.4.5 BiocSingular_1.6.0
## [31] xfun_0.21 pkgconfig_2.0.3
## [33] ggbeeswarm_0.6.0 htmltools_0.5.1.1
## [35] tibble_3.1.0 gridExtra_2.3
## [37] GenomeInfoDbData_1.2.4 edgeR_3.32.1
## [39] fansi_0.4.2 viridisLite_0.3.0
## [41] crayon_1.4.1 withr_2.4.1
## [43] bitops_1.0-6 grid_4.0.4
## [45] gtable_0.3.0 lifecycle_1.0.0
## [47] magrittr_2.0.1 dqrng_0.2.1
## [49] scales_1.1.1 stringi_1.5.3
## [51] scuttle_1.0.4 farver_2.1.0
## [53] XVector_0.30.0 viridis_0.5.1
## [55] limma_3.46.0 DelayedMatrixStats_1.12.3
## [57] ellipsis_0.3.1 vctrs_0.3.6
## [59] tools_4.0.4 glue_1.4.2
## [61] beeswarm_0.3.1 ResidualMatrix_1.0.0
## [63] yaml_2.2.1 colorspace_2.0-0
## [65] knitr_1.31